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IMPROVED METHODS OF SEQUENCE ASSEMBLY 
IN SEQUENCING BY HYBRIDIZATION 

FIELD OF THE INVENTION 

The invention relates generally to novel methods, materials and devices for 
5 nucleic acid sequence analysis by hybridization, in which the sequence information 
obtainable not only from perfectly matched oligonucleotide probes but also from 
oligonucleotide probes that are not perfectly matched to the target nucleic acid is 
taken into account. 

BACKGROUND 

*0 The rate of determining the sequence of the four nucleotides in nucleic acid 

samples is a major technical obstacle for further advancement of molecular biolo >y, 
medicine, and biotechnology. Nucleic acid sequencing methods which involve 
separation of nucleic acid molecules in a gel have been in use since 1978. The other 
proven method for sequencing nucleic acids is sequencing by hybridization (SBli). 

*5 The traditional method of determining a sequence of nucleotides (i.e., the 

order of the A, G, C and T nucleotides in a sample) is performed by preparing, a 
mixture of randomly terminated, differentially labelled nucleic acid fragments by 
degradation at specific nucleotides, or by dideoxy chain termination of replicating 
strands. Resulting nucleic acid fragments in the range of 1 to 500 bp are then 

20 separated on a gel to produce a ladder of bands wherein the adjacent samples differ 
in length by one nucleotide. 

The array based approach of SBH does not require single base resolution in 
separation, degradation, synthesis or imaging of a nucleic acid molecule. Using 
mismatch discriminative hybridization of short oligonucleotides K bases in length, 
25 lists of constituent K-mer oligonucleotides may be determined for target nucleic 
acid. Sequence for the target nucleic acid may be assembled by uniquely 
overlapping scored oligonucleotides. 



WO 00/79007 PCTAJSOO/16899 



-2- 

There are several approaches available to achieve sequencing by 
hybridization. In a process called SBH Format 1, nucleic acid samples are arrayed, 
and labeled probes are hybridized with the samples. Replica membranes with the 
same sets of sample nucleic acids may be used for parallel scoring of several probes 
5 and/or probes may be multiplexed. Nucleic acid samples may be arrayed and 

hybridized on nylon membranes or other suitable supports. Each membrane array 
may be reused many times. Format 1 is especially efficient for batch processing 
large numbers of samples. 

In SBH Format 2, probes are arrayed at locations on a substrate which 

10 correspond to their respective sequences, and a labelled nucleic acid sample 

fragment is hybridized to the arrayed probes. In this case, sequence information 
about a fragment may be determined in a simultaneous hybridization reaction with 
all of the arrayed probes. For sequencing other nucleic acid fragments, the same 
oligonucleotide array may be reused. The arrays may be produced by spotting or 

15 by in situ synthesis of probes. 

In Format 3 SBH, two sets of probes are used. In one embodiment, a set 
may be in the form of arrays of probes with known positions, and another, labelled 
set may be stored in multiwell plates. In this case, target nucleic acid need not be 
labelled. Target nucleic acid and one or more labelled probes are added to the 

20 arrayed sets of probes. If one attached probe and one labelled probe both hybridize 
contiguously on the target nucleic acid, they are covalently ligated, producing a 
detected sequence equal to the sum of the length of the ligated probes. The 
process allows for sequencing long nucleic acid fragments, e.g. a complete 
bacterial genome, without nucleic acid subcloning in smaller pieces. 

25 However, to sequence long nucleic acids unambiguously, SBH involves the 

use of long probes. As the length of the probes increases, so does the number of 
probes required to generate sequence information. Each 2-fold increase in length 
of the target requires a one-base increase in the length of the probe, resulting in a 
four-fold increase in the number of probes required (the complete set of all possible 
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sequences of probes of length k contains 4 k probes). For example, sequencing 100 
bases of DNA requires 16,384 7-mers; sequencing 200 bases requires 65,536 
8-rners; 400 bases, 262,144 9-mers; 800 bases, 1,048,576 10-mers; 1600 bases, 
4,194,304 11-mers; 3200 bases, 16,777,216 12-mers; 6400 bases, 67,108,864 
13-mers; and 12,800 bases requires 268,435,456 14-mers. 

Because a limited number of probes can be scored in each array-based 
hybridization reaction, use of an extremely large number of probes requires 
carrying out multiple hybridization reactions. 

An improvement in SBH that reduces costs and increases accuracy would 
greatly enhance the practical ability to sequence long pieces of polynucleotides de 
novo. Such an improvement would, of course, also enhance resequencing and 
other applications of SBH. Thus, there remains a need for additional and improved 
methods and materials for performing sequence analysis by hybridization. 

SUMMARY OF THF TNVFNT^QTJ 

The present invention provides novel methods and materials, including 
apparatus, for performing sequence analysis by hybridization (referred to herein as 
"SBH"). Conventional methods of SBH utilize hybridization conditions selected to 
discriminate probe:target hybrids that are perfectly complementary in the 
information region (informative region) of the probe from probertarget hybrids that 
have even a single base pair mismatch. Conventional methods also assemble 
sequence information using a scoring system for the probes that gives a "positive" 
score to fully matched (perfectly complementary) probes and a "negative" score to 
all other probes (i.e., probes with a single, double, or more base pair mismatch 
compared to target). 

According to the present invention, the efficiency, sensitivity and accuracy 
of these methods is improved by taking into account the sequence information that 
is obtainable not only from probes that are perfectly matched to the target nucleic 
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acid sequence, but also from probes that have single, double or more mismatches 
compared to the target. 

The present invention provides methods for analyzing the sequence of a 
target nucleic acid, comprising the steps of: 
5 (a) contacting a target nucleic acid with a plurality of oligonucleotide 

probes of predetermined length and predetermined sequence, wherein each probe 
comprises an information region, under conditions which produce, on average, 
relatively more probeitarget hybrids per probe for probes that are perfectly 
complementary in the information region of the probe than for probes that are 
10 substantially perfectly complementary in the information region of the probe, and 
relatively fewer probeitarget hybrids that are significantly mismatched in the 
information region of the probe; 

(b) measuring the hybridization signal of hybridization of said probes with 
the target nucleic acid; and 

15 (c) assigning a numerical voting score to each probe or pool of probes 

based on the relative strength of hybridization signal; and 

(d) determining the sequence of the target nucleic acid, comprising the step 
of summing the numerical voting scores of the probes in relation to their 
sequences. 

20 Optionally, after step (c), the numerical voting score of the probe or pool of 

probes maybe modified by a voting factor selected based on the relationship of the 

probe to the hypothetical sequence. 

The number of probes used in the hybridization step may be at least about 

10, at least about 100, at least about 1000, at least about 10\ at least about 10 5 , at 
25 least about 10 6 , or at least about 10 7 different probes (meaning the number of 

distinct probe sequences), and may potentially range up to about 1 0 10 different 

probes or even more. 

The plurality of probes may be hybridized individually with the target or in 
groups or pools of probes. Probes may optionally be associated with or labeled 
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with identification tags. Each of the probes may be labeled with a unique 
identification tag; alternatively, a pool or a part of a pool of probes may be labeled 
with the same identification tag. 

Another aspect of the invention provides an improvement over 
5 conventional SBH methods wherein a subpool of probes is collectively given one 
score, and wherein the subpool if differentiated from other subpools within the 
pool via labeling with distinct identification tags. Pools and subpools can be 
formed either with respect to probes in solution or with respect to probes in fixed 
arrays. 

0 For example, a pool of 1024 pentamer probes is divided into 4 pools of 256 

pentamer probes each. In each pool of 256 probes, the probes are labeled with one 
of four different identification tags, so that one subpool of 64 probes will bear one 
tag, another subpool of 64 probes will bear a second tag, a further subpool of 64 
probes will bear a third tog, and a final subpool of 64 probes will bear a fourth tag. 

5 This can easily be accomplished by dividing the starting number of 1024 probes 

into four pools of 256 probes, labeling each pool with one of four tags, dividing the 
pools into four aliquots of 64 probes each, and combining an aliquot from each of 
the four pool to create a mixture of 256 probes labeled with 4 different tags. By 
creating subpools within a physical pool of probes through differential labeling of 

D probes with identification tags, the advantages of using smaller subpools of probes 
can be obtained without requiring physical separation of pools into subpools. Any 
suitable identification tag known in the art can be used; presently preferred tags are 
fluorescent labels of different wavelengths or "colors." This aspect of the invention 
is illustrated in Figure 1 . These subpools created through differential labeling can 

5 be used in the methods of the present invention, can be used as "informative 

pools", and can also be used in any SBH methods (including format 1, format 2 
and format 3 methods) known in the art and for any SBH applications known in 
the art, including de novo sequencing, resequencing, polymorphism detection, etc., 
but are preferably utilized in resequencing applications. 
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Thus, this aspect of the invention provides methods of sequencing by 
hybridization wherein the method comprises an additional step of detecting label 
(identification tag). The label detection step may occur before, after, or 
concurrently with the steps of detecting/measuring hybridization signal. 
5 This aspect of the invention also provides methods of sequencing by 

hybridization using subsets of probes that have been labeled with different tags and 
pooled in one or more pools by mixing all or a number of probes from each subset. 

This aspect of the invention further provides a pool of probes comprising a 
mixture of at least 100, or at least 200, or a least 300, or at least 400 distinct 

10 probes, each probe being associated with an identification tag. The probes in the 
pool may all be labeled with the same tag, or may be divided into two, three, four 
or more subpools in which all probes in a subpool are labeled with the same tag, 
but each subpool is associated with a different tag. The entire subpool is given one 
collective score, and when sequence information is assembled it is assumed that 

15 each probe within the subpool is assigned the same score, which may be either a 
positive/negative score or a numerical voting score. 

Another aspect of the invention provides an apparatus comprising means 
for carrying out the hybridization step and means for carrying out the detecting 
and/or measuring step(s), as described above. 

20 A further aspect of the invention provides an apparatus comprising means 

for carrying out the sequence analysis step, e.g. a computer programmed as set 
forth in Appendix A herein. 

Examples of applications that require very large numbers of probes are: (1) 
sequencing or resequencing of the entire human genome and other complex 
25 genomes, (2) sequencing or resequencing of total mRNA or cDNA in a human or 
other complex cell, (3) genotyping thousands or millions of single nucleotide 
polymorphisms in individual human genomes, (4) de novo sequencing of thousands 
of bases. 
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Numerous additional aspects and advantages of the invention will become 
apparent to those skilled in the ait upon consideration of the following detailed 
description of the invention which describes presently preferred embodiments 
thereof 

5 BRIEF DESCRIPTION OF THF TO 

Figure 1 illustrates a pooling schema for the use of subpools in Format 3 
sequencing by hybridization. Instead of the arrays of hexamers noted in the figure, 
arrays of 1024 pentamers, or arrays of 1024 pools of 4 hexamers (4096 hexamers) 
or arrays of 4096 pools of 4 heptamers may be used. The read length depends on 

0 the number of arrayed probes or arrayed probe pools and can range from 1 kb to 
several kb or more. 

Figure 2 depicts the result of sequence analysis of human apo~B gene in 
which probes were assigned numerical voting scores based on strength of 
hybridization, the voting scores were further modified by a voting fector based on 
5 relationship of the probe sequence to the hypothetical target sequence, and 

sequence information was assembled using the informational content of both fully 
matched and mismatched probes. 

PET AILED DESCRIPTION OF TRF mvp^nnM 

The three major steps of conventional SBH are biochemical hybridization 

3 of probes to target polynucleotide, detection of positive results, and informational 
sequence assembly from the results. In the biochemical hybridization step, a set of 
oligonucleotide probes of known sequence is allowed to hybridize with a target 
polynucleotide of unknown sequence. In the detection step, a subset of these 
probes is scored as positive (predominantly those that hybridize to and fully match 

> the target polynucleotide sequence). In the informational step, the target sequence 
is assembled using different algorithms, usually executed by computer programs, 
that uniquely overlap all sequences of the real (or true) positive probe subset. 
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Conventional SBH is a well developed technology that may be practiced by 
a number of methods known to those skilled in the art. For example, variations of 
techniques related to sequencing by hybridization are described in the following 
documents, all of which are incorporated by reference herein: Drmanac et al., U S. 
5 Patent No. 5,202,23 1 (hereby incorporated by reference herein) - Issued April 13, 
1993; Drmanac et al., U.S. Patent No. 5,525,464 (hereby incorporated by 
reference herein) • Issued June 1 1, 1996; Drmanac, PCT Patent Appln. No. WO 
95/09248 (hereby incorporated by reference); Drmanac et al., Genomics, 4, 
1 14-128 (1989); Drmanac et al., Proceedings of the First Intl. Conf. 

10 Electrophoresis Supercomputing Human Genome Cantor et al. eds, World 
Scientific Pub. Co., Singapore, 47-59 (1991); Drmanac et al., Science, 260, 
1649-1652 (1993); Lehrach et al., Genome Analysis: Genetic and Physical 
Mapping, 1, 39-81 (1990), Cold Spring Harbor Laboratory Press; Drmanac et al., 
Nucl. Acids Res., 4691 (1986); Stevanovic et al., Gene, 79, 139 (1989); Panusku 

15 et al., Mol. Biol. Evol., 1, 607 (1990); Nizetic et al., Nucl. Acids Res., 19, 182 
(1991); Drmanac et al., J. Biomol. Struct. Dyn., 5, 1085 (1991); Hoheisel et al., 
Mol. Gen., 4, 125-132 (1991); Strezoska et al., Proc. Natl. Acad. Sci. (USA), 88, 
10089 (1991); Drmanac et al., Nucl. Acids Res., 19, 5839 (1991); and Drmanac et 
al., Int. J. Genome Res., 1, 59-79 (1992), all of which are incorporated by 

20 reference herein. 

Conventional SBH approaches use arrays of target samples which are 
hybridized to labeled probes (Format 1), or arrays of probes which are hybridized 
to labeled target samples (Format 2), for efficient parallel scoring of multiple 
hybridization events. In one approach, either target samples or probes are attached 
25 to solid supports in the form of beads that serve to separate parallel hybridization 
reactions in the reading or detection step. Beads or other markers can be used as 
tags to identify probes. Mass spectrometry technology can also be used to 
distinguish probe species on the basis of their mass even when the probes are not 
tagged. In a Format 3 type SBH method, two sets of shorter probes are 
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combinatorially connected to simulate a much larger set of longer probes. 
Typically, in format 3, the first set of probes is fixed in the array, and the second 
set of probes is labeled. The labeled probes are added together with target nucleic 
acid, and a labeled probe is ligated to a fixed probe only if the two probes hybridize 

5 contiguously on the target nucleic acid. 

Format 1, 2 and 3 SBH methods are described in further detail below. In 
addition, a set of probes can be scored in the form of informative pools with 
minimal loss of information, as described in U.S. Application Serial No. 
60/1 15,284 entitled "Enhanced Sequencing by Hybridization Using Informative 

0 Pools of Probes" filed January 6, 1999, incorporated herein by reference. Other 
types of pools may be used. In addition to pooling probes that cannot be 
distinguished in the reading step, probes with unique tags can be multiplexed in the 
same hybridization reaction. Probes can be individually synthesized in separate 
reactions or in situ, or a combination of two much smaller sets of shorter probes 

5 may be used to score a much larger set of longer probes (1024 5-mers x 1024 5- 
mers = 1,048,576 10-mers). 

Unlike conventional SBH, which only uses and assembles sequence 
information from full match probes (i.e., probes that are perfectly matched to the 
target nucleic acid), the present invention utilizes hybridization information 

D obtainable not only from probes that are perfectly matched to the target nucleic 
acid sequence, but also from probes that have single, double or more mismatches 
compared to the target. 

There is typically a wide spread, or range, of hybridization (or 
hybridization/ligation) signals for full match probe.target hybrids, due to 

5 differences in hybrid stability, target accessibility and variations in probe quality 
and quantity. In conventional SBH, a single threshold is set for a positive score, 
meaning that probes with hybridization signals higher than the threshold are scored 
as positive and probes with hybridization signals lower than the threshold are 
scored as negative. Conventional procedures and algorithms for base calling 
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(determining the identity of the nucleotide at a particular position), either for 
resequencing using known reference sequences or for de novo sequencing, use this 
type of poshive/negative data to maximize accuracy and read length. 

However, there is also a wide range of signals for probertarget hybrids 
5 containing a single base pair mismatch, depending on the location of the mismatch 
and the type of mismatch, and there is a wide range of signals for probertarget 
hybrids containing a double base pair mismatch, etc. Consequently, the range of 
signals for full match probertarget hybrids often overlaps the range of signals for 
mismatched probertarget hybrids. In other words, the lower part of the range of 

10 signals for full match probertarget hybrids is often mixed with strong signals from 
mismatched probertarget hybrids (usually single mismatches). 

Thus, the setting of a single threshold may have the undesirable effects of 
creating false positive probes probes that form strongly hybridizing single 
mismatch probertarget hybrids) and false negative probes (i.e., probes that form 

15 weakly hybridizing full match probertarget hybrids). In addition, a system that 

scores only full match probertarget hybrids ignores the useful sequence information 
that can be provided by mismatched probertarget hybrids, particularly single 
mismatches. For example, if 10-mer probes are being used, for all of the single 
mismatch probes, 9 of the 10 bases will be a correct identification of the true base 

20 at that position. The numerical scoring of all probes according to the relative level 
or strength of hybridization allows the informational content of all probes to be 
taken into account. 

The present invention provides methods for analyzing the sequence of a 
target nucleic add, comprising the steps of 
25 00 contacting a target nucleic acid with a plurality of oligonucleotide 

probes of predetermined length and predetermined sequence, wherein each probe 
comprises an information region, under conditions which produce, on average, 
relatively more probertarget hybrids per probe for probes that are perfectly 
complementary in the information region of the probe than for probes that are 
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substantially perfectly complementary in the information region of the probe, and 
relatively fewer probe:target hybrids that are significantly mismatched in the 
information region of the probe; 

. (b) measuring the hybridization signal of said probe:target hybrids; and 
5 (c) assigning a numerical voting score to each probe or pool of probes 

based on the relative strength of hybridization signal; and 

(d) determining the sequence of the target nucleic add, comprising the step 
of analyzing the numerical voting scores of the probes in relation to their 
sequences. 

10 In step (a), the hybridization and/or wash conditions are selected to 

produce, on average, a relatively higher number of probe target hybrids for fully 
matched (perfectly complementary) probes than for probes with a single mismatch 
compared to target, and a relatively higher number of hybrids for single mismatch 
probes compared to double mismatch probes, and a relatively higher number of 

15 hybrids for double mismatch probes compared to triple mismatch probes, etc. 

In step (b), the hybridization signal of the probe:target hybrids is measured 
and the relative level (or strength) of the signal is determined. 

In step (c), a numerical voting score (also referred to herein as "voting 
power") is assigned to each probe or to a pool of probes based on the strength of 

20 the hybridization signal. The assignment of scores is described below in more 
detail in the section entitled "Assigning a Numerical Voting Score to Probes and 
Sequence Analysis." 

In step (d), the sequence may be analyzed by aligning all possible 
sequences, summing the numerical voting scores of all probes voting for a 

25 particular hypothetical base identity at a particular position, and determining which 
of the hypothesized bases is correct (/.e., has the most votes). Alternatively, the 
numerical voting score of the probes as assigned in step (c) may be further 
modified, or weighted, by a voting factor as described in the section entitled 
"Assigning a Numerical Voting Score to Probes and Sequence Analysis," wherein 
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the modification of the score for a probe depends on the relationship of that probe 
to the hypothesized sequence. 

The advantages of this approach to sequence assembly are that it eliminates 
the need for the separation of full and mismatched probes and thus reduces the 
5 number of false positive and false negatives, it obtains maximal benefit from the 

informational content of probes that form single or double mismatches, and it uses 
the majority of probes with readable matches, thus increasing the redundancy of 
reads several-fold compared to using full match probes only. 

Target Polynucleotide 

10 "Tai^get nucleic acid" or "target polynucleotide" refers to the nucleic acid of 

interest, typically the nucleic acid that is sequenced in the SBH assay. Potential 
target polynucleotides include naturally occurring or artificially created DNA (e.g., 
genomic DNA and cDNA) and RNA (e.g., mRNA), including nucleic acids used as 
part of DNA computing. The target nucleic acid may be composed of 

15 ribonucleotides, deoxyribonucleotides or mixtures thereof. Typically, the target 
nucleic acid is a DNA. While the target nucleic acid can be double-stranded, h is 
preferably single stranded. The "read length" of the target nucleic acid can be any 
number of nucleotides, depending on the length of the probes, but is typically on 
the order of 100, 200, 400, 800, 1600, 3200, 6400, or even more nucleotides in 

20 length, up to the entire human genome. 

The target nucleic acid can be obtained from virtually any source and can 
be prepared using methods known in the art. For example, target nucleic acids can 
be isolated by PCR methodology, or by cloning into plasmids (for a convenient 
target nucleic acid fragment length of 500 to 5,000 base pairs), or by cloning into 

25 yeast or bacterial artificial chromosomes (for a convenient target nucleic acid 
fragment length of up to lOOkb). 

Depending on the desired length for use in the SBH assay, the target 
nucleic acid may be sheared into fragments prior to use in an SBH assay. 
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Fragmentation may be accomplished by nonspecific endonuclease digestion, 
restriction en2yme digestion (e.g., by Cvi JT) y physical shearing (e.g., by 
ultrasound) or NaOH treatment. Fragments may be separated by size (e.g., by gel 
electrophoresis) to obtain the desired fragment length. Fragmentation of the target 
5 nucleic acid also may avoid hindrance to hybridization from secondary structure in 
the sample. The sizes of the target nucleic acid fragments used in the hybridization 
reaction optimally range in length from slightly longer than the probe length to 
twice the probe length, e.g., 10-100 or 10-40 bases. 

Probes 

10 "Probes" refers to relatively short pieces of nucleic acids, preferably DNA. 

Probes are preferably shorter than the target DNA by at least one base, and more 
preferably they are 25 bases or fewer in length, still more preferably 20 bases or 
fewer in length. Of course, the optimal length of a probe will depend on the length 
of the target nucleic acid being analyzed. For a target nucleic composed of about 

15 100 or fewer bases, the probes are at least 7-mers; for a target of about 100-200 
bases, the probes are at least 8-mers; for a target nucleic acid of about 200-400 
bases, the probes are at least 9-mers; for a target nucleic acid of about 400-800 
bases, the probes are at least 10-mers; for a target nucleic acid of about 800-1600 
bases, the probes are at least 1 1-mers; for a target of about 1600-3200 bases, the 

20 probes are at least 12-mers, for a target of about 3200-6400 bases, the probes are 
at least 13-mers; and for a target of about 6400-12,800 bases, the probes are at 
least 14-mers. For every additional two-fold increase in the length of the target 
nucleic acid, the optimal probe length is one additional base. Those of skill in the 
art will recognize that for Format 3 SBH applications, the above-delineated probe 

15 lengths are post-ligation. Thus, as used throughout, specific probe lengths refer to 
the actual length of the probes for Format 1 and 2 SBH applications and the 
lengths of ligated probes for Format 3 or Format 3-like SBH applications. Probes 
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are normally single stranded, although double-stranded probes may be used in 
some applications. 

Probes may be prepared using standard chemistry procedures known in the 
art. The length of the probes described above refers to the length of the 
informational content of the probes, not necessarily the actual physical length of 
the probes. The probes used in SBH frequently contain degenerate ends [e.g., one 
to three non-specified (mixed A,T,C and G) or universal (e.g. M base or inosine) 
bases at the ends] that aid hybridization but do not contribute to the information 
content of the probes. Hybridization discrimination of mismatches in these 
degenerate probe mixtures refers only to the length of the informational content, 
not the full physical length. For example, SBH applications frequently use 
mixtures of probes of the formula NxByNz, wherein N represents any of the four 
bases and varies for the polynucleotides in a given mixture, B represents any of the 
four bases but is the same for each of the polynucleotides in a given mixture, and x, 
y, and z are all integers. In this formula, Nx and Nz represent the degenerate ends 
of the probe and By represents the infonnation content of the probe (e.g., a 
uniquely arrayed probe in conventional SBH). 

The probes may consist solely of naturally-occurring nucleotides and native 
phosphodiester backbones, or the probes may be modified or tagged to enhance 
specificity of detection. For example, the probes may be composed of one or more 
modified bases, such as 7-deazaguanosine, or one or more modified backbone 
interlinkages, such as a phosphorothioate. The only requirement is that the probes 
be able to hybridize to the target nucleic acid. A wide variety of modified bases 
and backbone interlinkages that can be used in conjunction with the present 
invention are known, and will be apparent to those of skill in the art. 

Other variations include the use of modified oligonucleotides to increase 
specificity or efficiency, cycling hybridizations to increase the hybridization signal, 
for example by performing a hybridization cycle under conditions (e.g. 
temperature) optimally selected for a first set of labeled probes followed by 
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hybridization under conditions optimally selected for a second set of labeled 
probes. Shifts in reading frame may be determined by using mixtures (preferably 
mixtures of equimolar amounts) of probes ending in each of the four nucleotide 
bases A, T, C and G. 

5 The oligonucleotide probes are preferably labeled with identification tags to 

enhance detection or discrimination. Suitable labels include fluorescent dyes, 
chemiluminescent systems, radioactive labels (e.g., 35 S, 3 H, 32 P or ^P), or isotopes 
detectable by mass spectrometry, nanobeads, polymers or molecules of different 
size, shape, electrical, magnetic or other properties, attached by any of a variety of 

10 methods that are well known in the art. 

A complete set of all possible probes of a given length (4 N , where N is the 
length) or a subset of this complete set may be used in the hybridization step. 
Probes of differing lengths may also be used. A large number of probes may be 
synthesized in a small number of reactions. For example, a complete set of all 

15 possible 10-mers (about 1 million probes) may be synthesized as follows. 1000 5- 
mers, each uniquely associated with a 10-digit DNA bar code, are synthesized in 
1000 reactions and mixed. The mixture is divided into 1000 aliquots which then 
undergo 1000 reactions, during which the informational length of the probe is 
extended by an additional 5 nucleotides and the 10-digit barcode is extended by a 

20 further 10 digits, to form 1 million uniquely tagged 10-mers synthesized in only 
2000 reactions. 



Hybridization Reaction 

The number and type of probes that are used in each hybridization reaction 
depends on the detection power of the reader, the statistics of numerical scoring, 
including use of informative pools (or other pools), the length of target nucleic acid 
sequence, and the SBH application {e.g., whether de novo sequencing, 
resequencing or genotypuig is desired). A complete set of all possible probe 
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sequences of the same length may be used, or only a portion of this complete set 
may be used. Alternatively, probes of differing length may be used. 

Hybridization Conditions 

Hybridization and washing conditions are selected to provide a range of 
hybridization signals such that a gradation of signals is provided wherein fall match 
probes have higher signals than single mismatch probes, which in turn have higher 
signals than double mismatch probes, which in turn have higher signals than triple 
mismatch probes, etc. Conditions may be selected so as to detect substantially 
perfect match hybrids (such as those wherein the fragment and probe hybridize at 
six out of seven positions). Alternatively, slightly less stringent conditions than 
those that permit detection only of perfect match hybrids may be used. 

Suitable hybridization conditions may be routinely determined by 
optimization procedures or pilot studies. Such procedures and studies are 
routinely conducted by those skilled in the art to establish protocols for use in a 
laboratory. See e.g., Ausubel et a!., Current Protocols in Molecular Biology, Vol. 
1-2, John Wiley & Sons (1989); Sambrook et al., Molecular Cloning A Laboratory 
Manual, 2nd Ed., Vols. 1-3, Cold Springs Harbor Press (1989); and Maniatis et al., 
Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Laboratory Cold 
Spring Harbor, New York (1982), all of which are incorporated by reference 
herein. For example, conditions such as temperature, concentration of 
components, hybridization and washing times, buffer components, and their pH and 
ionic strength may be varied. 

Assigning a Numerical Vmipp Score to Probes and Sequent* 

The probes which have hybridized to the target polynucleotide during the 
hybridization reaction step can be assigned a numerical voting scons (voting 
power) based on the strength of the hybridization signal. 
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Data may be obtained by scoring each probe individually or by scoring 
pools of probes, including informative pools as described in U.S. Application Serial 
No. 60/1 15,284 entitled "Enhanced Sequencing by Hybridization Using 
Informative Pools of Probes" filed January 6, 1999, incorporated herein by 
5 reference. 

Probes can numerically scored and their sequences analyzed as follows. 
Probes are sorted by descending hybridization signal value, and a numerical voting 
score (voting power) is assigned to each probe based on the hybridization signal 
and a converting function. One possible converting function involves dividing the 

10 signal range into several segments (by setting a certain number of threshold steps) 
and to define the voting power for probes in each segment as the inverse of the 
number of probes in that segment. The lower boundary of the signal range that is 
taken into account is set at least at the background level of signal, but may be set 
higher as desired in order to simplify or speed computation without losing 

15 significant information. 

Probe sequences are aligned allowing for mismatches. For base calling, for 
example in resequencing, the identity of a base at a particular position is voted on 
by the number of occurrences of a base in aligned probes in combination with the 
numerical voting score (or voting power) of each probe. The numerical voting 
20 scores of all probes voting for a particular hypothetical base identity at a particular 
position may be summed, and the base identity may be confirmed by determining 
which of the hypothesized bases has the most votes. For example, because the 
voting power of a strongly hybridizing probe is higher, its vote as to the identity of 
the base is given more weight. 

m 

15 Alternatively, the assigned numerical voting score of the probes may be 

further modified, or weighted, by a voting factor, wherein the modification of the 
score for a probe depends on the relationship of that probe to the hypothesized 
sequence. For example, when 10-mers are used in resequencing, if the base at 
position 100 is hypothesized to be an adenine (A), there will be 10 full match 
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(perfectly complementary) probes for the A at that position, and 270 (9x3x 10) 
single mismatch probes for the A that position. The voting factor by which the 
numerical voting score is modified may be set to a multiplier of 100 for full match 
probes, a multiplier of 20 for single mismatch probes, and a multiplier of 2 for all 
5 other probes. The votes are then summed after modification by the appropriate 
voting factor, and the voting process is repeated for each of the four hypothesized 
bases at position 100 (A, T, C and G). The hypothesized base that has the highest 
number of votes may be declared the correct base. This voting process can be used 
to solve single-position base problems or can be used to determine the identity of 

10 two consecutive base positions (in which case there are sixteen, rather than four, 
hypothesized two-base combinations). 

However, as illustrated in Example 1 below, a correct solution requires an 
absolute minimum number of votes. If none of the hypothetical base candidates 
receives the minimum number of votes, then the process needs to be repeated. In 

15 addition, a correct solution requires the "winning" candidate to have a sufficiently 
high number of votes in comparison to the other candidates. In the case of a 
heterozygous position (two genes are present, and each gene has a different base at 
that position), there can be two "winning" candidates, but each of the "winners" 
must still have a sufficiently high number of votes compared to the other 

20 candidates. 

Alternatively, the voting power of probes can be used to determine whether 
a probe is a full match probe, for example, in de novo sequencing. The selected 
probe in question is aligned with all probes that have a single or more mismatches 
(mismatches when compared to the probe in question). Statistics are applied that 
25 take into account the voting power of each of the mismatched probes (and may 

include the vote of the selected probe itself). If the selected probe is actually a full 
match probe, probes that have a single mismatch compared to the selected probes 
should still hybridize strongly. However, if the selected probe is actually a strongly 
hybridizing single mismatched probe, probes that have a single mismatch compared 
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to the selected probe will be double mismatched probes compared to the target 
nucleic acid sequence and thus will hybridize relatively more weakly. For example, 
summing the numerical voting score (or the numerical voting score as modified by 
voting factor) of all probes having a single mismatch in comparison to the selected 
5 probe thus will indicate whether the selected probe is a foil match. 

Probes with end mismatches typically have a stronger hybridization signal 
than probes with internal mismatches and thus it may be desirable to set voting 
factors so that these probes are given relatively more voting power. 

For determining the correct full match probes among several closely related 
10 probes that share an identical middle portion (especially important in de novo 
sequencing), 5' to 3' positions of mismatch may be taken into account. For 
example, bridging across a branching point of a repeated 6-mer sequence can be 
done by sorting all probes by the central 6-mer and determining which of the 
positive probes are full match probes. If, for example, there are 6 positive probes 
15 sharing the same middle 6-mer sequence, and one assumes that only two probes are 
true full match probes, then single and double mismatches may be taken into 
account when voting for the full match probes. This approach has the advantages 
of eliminating false positives and reducing the occurrence of false sequence 
assembly. 

The above described sequence analysis methods of the invention were 
carried out for resequencing of a 700 base pair fragment of the human apo-B gene 
in one cardiovascular patient as follows. 

■ 

DNA was prepared by PCR with one phosphorylated primer. Lambda 
25 exonuclease was used to degrade the phosphorylated strand and the remaining 
single stranded DNA was randomly fragmented by endonuclease DNAse I. The 
target DNA was mixed with 16 pools containing 64 TAMRA labeled 5-mer probes 
and hybridized to 4 HyChips each containing four 5-mer arrays. The liybridization 
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image was detected using a fluorescent scanner and a hybridization score for each 
of about 16,000 test dots was determined using an image analysis program. 

Probes were sorted by descending hybridization signal value, and a 
numerical voting score was assigned to each probe based on its hybridization signal 
5 and a converting function. In this case, for the top 2000 dots (each corresponding 
to a pool of 64 pentamers scoring 64 1 0-mers), the probes were assigned an initial 
numerical voting score that was equal to their hybridization signal, while all other 
probes were assigned a numerical voting score of zero. 

The assigned numerical voting score of each probe was farther weighted by 

10 a voting factor. The voting factor was set to a multiplier of 100 for full match 
probes and a multiplier of 1 for all other probes. The votes were then summed 
after modification by the appropriate voting factor. The modified numerical voting 
scores (modified by the appropriate voting factor) of all probes voting for a 
particular hypothetical base identity at a particular position were summed, and the 

15 base identity was confirmed by determining which of the hypothesized bases had 
the most votes. 

The results of the sequence analysis according to this voting schema are 
shown in Figure 2. The figure depicts a 400-500 base segment of the 700 base pair 
fragment sequenced. For each base position, all four nucleotide options were 

20 tested. The sum of the votes is plotted for each nucleotide at each position, and 
the points in the graph are marked with corresponding nucleotide letters. The 
capital letters denote the apo-B reference sequence. At positions 485 and 486, the 
sequence is undeterminable because none of the four bases received a minimum 
number of votes (each base received a total score of approximately 2000 votes 

25 each). At position 471, there are two correct bases (G and A) that each received a 
total score of approximately 10,000 (a 10-fold difference relative to the other two 
base candidates), indicating that the sample is heterozygous. 
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Description of conventional Format 1 and ^ 

A. Assay format 

Format 1 SBH is appropriate for the simultaneous analysis of a large set of 
samples. Parallel scoring of thousands of samples on large arrays may be performed 
5 in thousands of independent hybridization reactions using small pieces of 

membranes. The identification of DNA may involve 1-20 probes per reaction and 
the identification of mutations may in some cases involve more than 1000 probes 
specifically selected or designed for each sample. For identification of the nature of 
the mutated DNA segments, specific probes may be synthesized or selected for 

10 each mutation detected in the first round of hybridizations. 

DNA samples may be prepared in small arrays which may be separated by 
appropriate spacers, and which may be simultaneously tested with probes selected 
from a set of oligonucleotides which may be arrayed in multiwell plates. Small 
arrays may consist of one or more samples. DNA samples in each small array may 

15 include mutants or individual samples of a sequence. Consecutive small arrays may 
be organized into larger arrays. Such larger arrays may include replication of the 
same small array or may include arrays of samples of different DNA fragments. A 
universal set of probes includes sufficient probes to analyze a DNA fragment with 
prespecified precision, e.g. with respect to the redundancy of reading each base 

JO pair ("bp M ). These sets may include more probes than are necessary for one 
specific fragment, but may include fewer probes than are necessary for testing 
thousands of DNA samples of different sequence. 

DNA or allele identification and a diagnostic sequencing process may 
include the steps of: 

!5 1) Selection of a subset of probes from a dedicated, representative or 

universal set to be hybridized with each of a plurality of small arrays; 

2) Adding a first probe to each subarray on each of the arrays to be 
analyzed in parallel; 

3) Performing hybridization and scoring of the hybridization results; 
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4) Stripping off previously used probes; 

5) Repeating hybridization, scoring and stripping steps for the remaining 
probes which are to be scored; 

5) Processing the obtained results to obtain a final analysis or to determine 
5 additional probes to be hybridized; 

6) Performing additional hybridizations for certain subarrays; and 

7) Processing complete sets of data and computing obtaining a final 
analysis. 

This approach provides fast identification and sequencing of a small number 

10 of nucleic acid samples of one type (e.g. DNA, RNA), and also provides parallel 
analysis of many sample types in the form of subarrays by using a presynthesized 
set of probes of manageable size. Two approaches have been combined to produce 
an efficient and versatile process for the determination of DNA identity, for DNA 
diagnostics, and for identification of mutations. 

15 For the identification of known sequences, a small set of shorter probes 

may be used in place of a longer unique probe. In this approach, although there 
may be more probes to be scored, a universal set of probes may be synthesized to 
cover any type of sequence. For example, a full set of 6-mers includes only 4,096 
probes, and a complete set of 7-mers includes only 16,384 probes. 

20 FuU sequencing of a DNA fragment may be performed with two levels of 

hybridization. One level is hybridization of a sufficient set of probes that cover 
every base at least once. For this purpose, a specific set of probes may be 
synthesized for a standard sample. The results of hybridization with such a set of 
probes reveal whether and where mutations (differences) occur in non-standard 

25 samples. To determine the identity of the changes, additional specific probes may 
be hybridized to the sample. 

In another variation, all probes from a universal set may be scored. A 
universal set of probes allows scoring of a relatively small number of probes per 
sample in a two step process without an undesirable expenditure of time. The 
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hybridization process may involve successive probings, in a first step of computing 
an optimal subset of probes to be hybridized first and, then, on the basis of the 
obtained results, a second step of determining additional probes to be scored from 
among those in a universal set. 

S B. Sequence AsSgmMy 

In SBH sequence assembly, K -1 oligonucleotides which occur repeatedly 
in analyzed DNA fragments due to chance or biological reasons may be subject to 
special consideration. If there is no additional information, relatively small 
fragments of DNA may be fully assembled in as much as every base pair is read 
10 several times. 

In the assembly of relatively longer fragments, ambiguities may arise due to 
the repeated occurrence in a set of positively-scored probes of a K-l sequence (i.e., 
a sequence shorter than the length of the probe). This difficulty does not exist if 
mutated or similar sequences have to be determined. Knowledge of one sequence 

15 may be used as a template to correctly assemble a sequence known to be similar 

(e.g. by its presence in a database) by arraying the positive probes for the unknown 
sequence to display the best fit on the template. 

Within DNA, the location of certain probes may be interchangeable when 
determined by overlapping the sequence data, resulting in an ambiguity as to the 

10 position of the partial sequence. Although the sequence information is determined 
by SBH, either: (i) long read length, single-pass gel sequencing at a fraction of the 
cost of complete gel sequencing; or (ii) comparison to related sequences, may be 
used to order hybridization data where such ambiguities ("branch points") occur. 
In addition, segments in junk DNA (which is not found in genes) may be repeated 

15 many times in tandem. Although the sequence of the segments is determined by 
SBH, single-pass gel sequencing may be used to determine the number of tandem 
repeats where tandemly-repeated segments occur. As tandem repeats occur rarely 
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in protein-encoding portions of a gene, the gel-sequencing step will be performed 
only when a commercial value for the sequence is determined. 



Sequencing 



The use of an array of sample arrays avoids consecutive scoring of many 
5 oligonucleotides on a single sample or on a small set of samples. This approach 
allows the scoring of more probes in parallel by manipulation of only one physical 
object. Subarrays of DNA samples 1000 bp in length may be sequenced in a 
relatively short period of time. If the samples are spotted at 50 subarrays in an 
array and the amy is reprobed 10 times, 500 probes may be scored. In screening 

1 0 for the occurrence of a mutation, approximately 335 probes may be used to cover 
each base three times. If a mutation is present, several covering probes will be 
affected. The use of information about the identity of negative probes may map the 
mutation with a two base precision. To solve a single base mutation mapped with 
this precision, an additional 15 probes may be employed. These probes cover any 

1 5 base combination for two questionable positions (assuming that deletions and 
insertions are not involved). These probes may be scored in one cycle on 50 
subarrays which contain a given sample. In the implementation of a multiple label 
color scheme (i.e., multiplexing), two to six probes, each having a different label 
such as a different fluorescent dye, may be used as a pool, thereby reducing the 

20 number of hybridization cycles and shortening the sequencing process. 

In more complicated cases, there may be two close mutations or insertions. 
They may be handled with more probes. For example, a three base insertion may be 
solved with 64 probes. The most complicated cases may be approached by several 
steps of hybridization, and the selecting of a new set of probes on the basis of 

25 results of previous hybridizations. 

If subarrays to be analyzed include tens or hundreds of samples of one type, 
then several of them may be found to contain one or more changes (mutations, 
insertions, or deletions). For each segment where mutation occurs, a specific set of 
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probes may be scored. The total number of probes to be scored for a type of 
sample may be several hundreds. The scoring of replica arrays in parallel facilitates 
scoring of hundreds of probes in a relatively small number of cycles. In addition, 
compatible probes may be pooled. Positive hybridizations may be assigned to the 
probes selected to check particular DNA segments because these segments usually 
differ in 75% of their constituent bases. 

By using a larger set of longer probes, longer targets may be conveniently 
analyzed. These targets may represent pools of shorter fragments such as pools of 
exon clones. 



10 D. Identification nf TftftfTffi 

A specific hybridization scoring method may be employed to define the 
presence of heterozygotes (sequence variants) in a genomic segment to be 
sequenced from a diploid chromosomal set. Two variations are where: i) the 
sequence from one chromosome represents a basic type and the sequence from the 
15 other represents a new variant; or, ii) both chromosomes contain new, but different 
variants. In the first case, the scanning step designed to map changes gives a 
maximal signal difference of two-fold at the heterozygotic position. In the second 
case, there is no masking, but a more complicated selection of the probes for the 
subsequent rounds of hybridizations may be indicated. 
20 Scoring two-fold signal differences required in the first case may be 

eflBciently by comparing corresponding signals with controls containing 
only the basic sequence type and with the signals from other analyzed samples. 
This approach allows detennination of a relative reduction in the hybridization 

m 

signal for each particular probe in a given sample. This is significant because 
25 hybridization efficiency may vary more than two-fold for a particular probe 

hybridized with different DNA fragments having the same full match target. In 
addition, heterozygotic sites may affect more than one probe depending upon the 
number of oligonucleotide probes. Decrease of the signal for two to four 
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consecutive probes produces a more significant indication of heterozygotic sites. 
Results may be checked by testing with small sets of selected probes among which 
one or few probes selected to give a full match signal which is on average 
. _ eight-fold stronger than the signals coming from mismatch-containing duplexes. 

5 Partitioned membranes allow a very flexible organization of experiments to 

accommodate relatively larger numbers of samples representing a given sequence 
type, or many different types of samples represented with relatively small numbers 
of samples. A range of 4-256 samples can be handled with particular efficiency. 
Subarrays within this range of numbers of dots may be designed to match the 

10 configuration and size of standard multiwell plates used for storing and labeling 
oligonucleotides. The size of the subarTays may be adjusted for different number 
of samples, or a few standard subarray sizes may be used. If all samples of a type 
do not fit in one subarray, additional subarrays or membranes may be used and 
processed with the same probes. In addition, by adjusting the number of replicas 

15 for each subarray, the time for completion of identification or sequencing process 
may be varied. 



Signature Analysis with SBH 

Obtaining information about the degree of hybridization exhibited for a set 
of only about 200 oligonucleotides probes (about 5% of the effort required for 
20 complete sequencing) defines a unique signature of each gene and may be used for 
sorting the cDNAs from a library to determine if the library contains multiple 
copies of the same gene. By such signatures, identical, similar and different 
cDNAs can be distinguished and inventoried. 



Format 3 Sequencing h y Hybridization 

In Format 3, a first set of oligonucleotide probes of known sequence is 
immobilized on a solid support under conditions which permit them to hybridize 
with nucleic acids having respectively complementary sequences. A labeled, 
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second set of oligonucleotide probes is provided in solution. Both within the sets 
and between the sets the probes may be of the same length or of different lengths. 
A nucleic acid to be sequenced or intermediate fragments thereof may be applied to 
the first set of probes in double-stranded form (especially where a recA protein is 
present to permit hybridization under non-denaturing conditions), or in 
single-stranded form and under conditions which permit hybrids of different 
degrees of complementarity (for example, under conditions which discriminate 
between full match and one base pair mismatch hybrids). The nucleic acid to be 
sequenced or intermediate fragments thereof may be applied to the first set of 
probes before, after or simultaneously with the second set of probes. A ligase or 
other means of causing chemical bond formation between adjacent, but not 
between nonadjacent, probes may be applied before, after or simultaneously with 
the second set of probes. After permitting adjacent probes to be chemically 
bonded, fragments and probes which are not immobilized to the surface by 
15 chemical bonding to a member of the first set of probe are washed away, for 
example, using a high temperature (up to 1 00 degrees C) wash solution which 
melts hybrids. The bound probes from the second set may then be detected using 
means appropriate to the label employed (which may be cheiriUuminescent, 
fluorescent, radioactive, enzymatic or densitometric, for example). 
20 Herein, nucleotide bases "match" or are "complementary" if they form a stable 
duplex by hydrogen bonding under specified conditions. For example, under 
conditions commonly employed in hybridization assays, adenine ("A") matches 
thymine ("T"), but not guanine ("G") or cytosine ("C"). Similarly, G matches C, 
but not A or T. Other bases which will hydrogen bond in less specific fashion, such 
25 as inosine or the Universal Base ("M" base, Nichols et al 1 994), or other modified 
bases, such as methylated bases, for example, are complementary to those bases for 
which they form a stable duplex under specified conditions. A probe is said to be 
"perfectly complementary" or is said to be a "perfectly match" if each base in the 
probe forms a duplex by hydrogen bonding to a base in the nucleic acid to be 
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sequenced. Each base in a probe that does not form a stable duplex is said to be a 
"mismatch" under the specified hybridization conditions. 

A list of probes may be assembled wherein each probe is a perfect match to 
the nucleic acid to be sequenced. The probes on this list may then be analyzed to 
order them in maximal overlap fashion. Such ordering may be accomplished by 
comparing a first probe to each of the other probes on the list to determine which 
probe has a 3* end which has the longest sequence of bases identical to the 
sequence of bases at the 5* end of a second probe. The first and second probes 
may then be overlapped, and the process may be repeated by comparing the 5" end 
of the second probe to the 3' end of all of the remaining probes and by comparing 
the 3' end of the first probe with the 5' end of all of the remaining probes. The 
process may be continued until there are no probes on the list which have not been 
overlapped with other probes. Alternatively, more than one probe may be selected 
from the list of positive probes, and more than one set of overlapped probes 
("sequence nucleus") may be generated in parallel. The list of probes for either 
such process of sequence assembly may be the list of all probes which are perfectly 
complementary to the nucleic acid to be sequenced or may be any subset thereof 

The 5' and 3' ends of sequence nuclei may be overlapped to generate longer 
stretches of sequence. Where ambiguities arise in sequence assembly due to the 
availability of alternative proper overlaps with probes or sequence nuclei, 
hybridization with longer probes spanning the site of overlap alternatives, 
competitive hybridization, ligation of alternative end to end pairs of probes 
spanning the site of ambiguity or single pass gel analysis (to provide an 
unambiguous framework for sequence assembly) may be used. 

By employing the above procedures, one may obtain any desired level of 
sequence, from a pattern of hybridization (which may be correlated with the 
identity of a nucleic acid sample to serve as a signature for identifying the nucleic 
acid sample) to overlapping or non-overlapping probes up through assembled 
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sequence nuclei and on to complete sequence for an intermediate fragment or an 
entire source DNA molecule (e.g. a chromosome). 

Sequencing may generally comprise the following steps: 

(a) contacting an aroy of immobilized oligonucleotide probes with a nucleic acid 
5 fragment under conditions effective to allow a fragment with a sequence 

complementary to that of an immobilized probe to form a primary complex with 
the immobilized probe such that the fragment has a hybridized and a 
non-hybridized portion; 

(b) contacting a primary complex with a set of labeled oligonucleotide probes in 
10 solution under conditions effective to allow a primary complex including an 

unhybridized sequence complementary to that of a labeled probe to hybridize to the 
labeled probe, thereby forming a secondary complex wherein the fragment is 
hybridized with both an immobilized probe and a labeled probe; 

(c) removing from a secondary complex any labeled probe that has not hybridized 
15 adjacent to an immobilized probe; 

(d) detecting the presence of adjacent labeled and unlabeled probes by detecting 
the presence of the label; and 

(e) determining a nucleotide sequence of the fragment by connecting the known 
sequence of the immobilized and labeled probes. 

20 111 this embodiment of SBH, ligation may be implemented by a chemical 

ligating agent (e.g. water-soluble carbodiimide or cyanogen bromide). A ligase 
enzyme, such as the commercially available T4 DNA ligase from T4 bacteriophage, 
may be employed. The washing conditions which are selected to distinguish 
between adjacent versus nonadjacent labeled and immobilized probes are selected 

15 to make use of the difference in stability of continuously stacked or ligated adjacent 
probes. 
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Numerous modifications and variations in the practice of the invention are 
expected to occur to those skilled in the art upon consideration of the foregoing 
description of the presently preferred embodiments thereof. Consequently, the 
only limitations which should be placed upon the scope of the present invention are 
5 those which appear in the appended claims. 
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# ! /usr/leo/bin/perl 
BEGIN { 

if ( -d '/usr/leo»){ 

my $length=$#INC; 
foreach my $i (0 .. $length) { 
$INC[$i]=-s/local/leo/g; 



use strict 'vars'; 

if (scalar ©ARGV < 6) { die "Need ini-file, score-file, #dots, 
score-column, Title, DecisionRuleFile, [FullsRatio] , [01 - f lag :of f ] \n» } 

my $FullRatio=»inf "; 
$FullRatio=$ARGV[6] if ($ARGV(6j); 
my $OnlyPulls=0; 
my $Dots=$ARGV[2] ; 
my $Collumn=$ARGV[33 ; 
if ($FullRatio eq "inf") 

{ 

$OnlyFulls=l; 
my $useZeroOne=0; 

$useZeroOne=$ARGV[7] if <$ARGV[7] ) ; 
my $Title = $ARGV[4] ; 

my $NumCandidates=5; 

my $MinQuartiler=.12; 

my $MinRatioHom=2.0; 

my $MinRatioWindowHet=.15 ; 

my $MaxRatioRef =0 , 5; 

my $Start=0; 

my $End=-l; 

my $LogScale=0; 

read_decision($ARGV[5] ) ; 

my $doHeterozygotes=l; 

my $File= "votes . $Dots . $FullRatio . $useZeroOne . $Collumn" ; 
my $Driver= "driver . $Dots . $FullRatio . $useZeroOne . $Collumn" ; 

process_ini ($ARGV[0] ) ; 

sub proces6_ini { 

my ($ini_f ile) =©_; 

open ( INI ,$ini_file) || die "Unable to open $ini file\n"; 

my $f lag=0 ; ~ 

my $type=" " ; 

my %data; 

my ©Map; 

my ©Pool; 

my ©RawScores; 

while (<INI>) 

{ 

Chop $_; 

if ( !/#/ $flag) 

$data{$type}{$flag}=$_; 
$f lag++ ; 

} 

if (/#Maps/) 
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$flag=l; 
$type="map n ; 

f (/#Pools/) 

$flag=l; 
$type="pool" ; 

f </#Targets/> 

$flag=l ; 
$type=" target"; 

f (/#Images/) 

$flag=l; 
$type=" image" ; 

^ iast if (/#Spiking/) ; 

die "Not enough pool info\n" if (scalar keys %{$data{ "pool" } } <i) ; 
foreach $flag (keys %{$data{ "pool" } } ) 

$Pool [$f lag] =read_j>ool ($data{ "pool" } { $f lag} ) ; 

print STDERH "Pool«$flag has ", scalar @{$Pool [$f lag] } , « 

element s\n"; 

} 

#Read wild- type sequence 

my ($file) = split (/\s+/, $data{ "target" } {l} ) ; 

my $Seq=read_sequence($file) ; 

my $Scores = read_s cores ($ARGV [1] ) ; 

my ($ target , $pos) ; 

my %PoolInd; 

foreach my $pool (0..$#Pool) 

foreach my $lp (@{$Pool [$pool] } ) 
^ $ Poo 1 1 nd { $ lp } = $pool ; 

my $Full=getFulls ($Seq, \@Pool , \%PoolInd) ; 

foreach $target (sort { $a<=>$b } keys %{$Scores}) 

my ©Names = split (A//, $data { "target "}{ $target }) ; 
my $name = $Names [$#NamesJ ; 

open (DUMP, ">$ Pile. dump. $ target") ; 
my ( $probe , $pool , ©scores ) ; 

foreach $probe (keys %{$Scores->{$target} }) 

foreach $pool (keys %{ $Scores->{$ target } {$probe} } ) 
push 

©scores , new_score ( $probe , $pool , $Scores - > { $target } { $probe } { $pool } ) ; 

} } 

©scores = sort { $b-> [2] <=> $a->(2] } ©scores; 

if ($Dots eq "best") 

{ 

my %Rank; 

foreach my $rank (0 . . $#scores) 
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$Rank{$scores [$rank] -> [o] } {$scores [$rank) -> [1] }=$rank; 

my ©Rank; 

foreach my $pos (0 .. length ($Seq) -10) 
my $fp=substr($Seq, $pos,5) ; 

my $pool = $PoolInd{substr($Seq,$pos+5,5) }; 
^ push ©Rank, $Rank{$fp) {$poolj ; 

©Rank = sort {$a<=>$b} ©Rank; 

my $cum=0; 

my $max=0; 

my $bestpos=0; 

my $numFulls = length ($Seq) -9 ; 
foreach my $rank (©Rank) 

$cum++ ; 

my $x= ($cum/$numFulls) -$rank/ (scalar ©scores) ; 
if ($x > $max) 

{ 

$max=$x; 

$bes tpos = $ rank ; 

} 

$Dots=$bestpos+l ; 
@scores=@scores [0 . .$Dots-l] ; 

elsif ($Dots ne f, inf n ) 

@scores=@scores [0. .$Dots-l] ; 
my %Tens; 

foreach my $score (©scores) 
my $Votes=l; 

$Votes = $score->[2] if ( ! $useZeroOne) ; 

$Votes = log($score->[2] )/log(2) if ($LogScale && 

! $useZeroOne) ; 

foreach my $lp <©{ $Pool [$score-> [1] ) } ) 
$Tens{$score-> [0] . $lp}=$Votes / 

) } 

my %ref Scores; 
my %bestScores; 
my %ratioB; 
my %Call; 

foreach $pos ($Start .. length ($Seq) +$End) 

my $theChar=substr ($Seq, $pos,l) ; 

my $baseVote=getVotes ($pos, $Seq, $Full, \%Tens) ; 
my 

($ref Score, $bestSol, $arrRef) =sortVotes ($theChar , $baseVote) ; 

foreach my $ch (@{$arrRef)) 

^ $bestScores{$pos} {$ch}=$baseVote->{$ch} ; 

my $ratio = $bestSol/ ($ref Score+ . 01 ) ; 

$ref Scores { $pos } =$ref Score 

$rat ios { $pos } =$rat io ; 

if ($ratio < $MaxRatioRef ) 
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{ 

^ $Call{$pos}=»R»; 
if ($ratio > $MinRatioHom) 
$Call{$pos}="M"; 

} } 

my ©quartile = sort { $ref Scores { $a} <=> $ref Scores {$b} } keys 



% ref Scores; 



foreach my $i (int ($MinQuartile*$#guartile) . . $#quartile) 



my $pos=$quartile ; 

i f ( abs ( $rat ios { $pos } - 1 . 0 ) <=$MinRat ioWindowHet ) 
if ($Call{$pos} eq "") 
$Call{$pos}="H n ; 

» 1 1 

foreach $pos (keys %ratios) 
if ($Call{$pos} eq M ») 
$Call{$pos}="X" ; 

} 1 

my %bestChar; 

foreach $pos ($Start .. length ($Seq) +$End) 

my $theChar=substr ($Seq, $pos,l) ; 

my $baseVote=getVotes ($pos, $Seq, $Full, \%Tens) ; 

my 

<$refScore, $bestSol, $arrRef ) =sortVotes ($theChar , $baseVote) ; 

printf DUMP »%d %s %8.3f '\$pos, $theChar, $refScore ; 
$bestChar{$pos}=$arrRef->lO] ; 
foreach my $ch (@{$arrRef}) 

my $score; 

$score= sprintf »%8 . 3f \ $baseVote->{$ch} ; 
$score=» NA » if ($baseVote- > { $ch} == -l) ; 
printf DUMP »%s %s » , $ch ,$score? 
^ $bestScores { $pos } { $ch} =$baseVote - > { $ch} ; 

my $ratio = $bestSol/ ($ref Score+ . 01) ; 
$ref Scores { $pos } «=$ref Score ; 
$rat ios { $pos } =$rat io ; 
^ printf DUMP «%5.3f %s\n» , $ratio, $Call{$pos} ; 

close (DUMP) ; 

my $driver=$Driver . n .$target" ; 
open (DRIVER, n >$driver") ; 
print DRIVER 

"source ( \"/f 3analysis/NewMutScan/MutScan/seqPlot .splus\") \n" ; 

print DRIVER 

«coryPS(\»$Pile.dump.$target\" / out=\"$File.$target.ps\» / tit=\"$Title 
Targ=$target $name\")\n"; 

close DRIVER; 

my $dir = ~pwd\- 

system ( "/usr/leo/splus/Splus BATCH $driver log"); 
open (REPORT, ">$File.rep.$target") ; 
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print REPORT "Title=$Title Target=$target\n» • 
print REPORT «Dots=$DotB Scale=$FullRatio ua4o/l : $ UB eZeroOne 
ScoreCollumn : $Collumn\n M ; 

print REPORT "Start=$Start End=$End\n» ; 

ffetn..™^!. ?M' n l RE ^ T " Ref T hresh: $ MaxRati oRef, HomThresh:$MinRatioHom, 
HetQuartile : $MinQuartile HetWindow: $MinRatioWindowHet 
Candidates : $NumCandidates\n" ; 

my % count; 

foreach $pos ($Start .. length ($Seq) +$End) 

$count{$Call{$pos}}++; - 
if (($Call{$poa} eq "M") | | ($Call{$pos} eq «H»)) 

m print REPORT "$Call{$pos} $pos 

",substr($Seq,$pos,l) , " -" , $bestChar {$pos} , «\n» ; 

} } 

print REPORT » ---\nType # %\n\- 

foreach my $call (R,M,H,X) 

printf REPORT "%s %3d 
%5 . 3f \n» , $call , $count { $call } +0 , $count { $call } / ( scalar keys %Call) ; 

close (REPORT) ; 

my ©pos = sort { $ratios{$b}<=>$ratios{$a} } (keys %ratios) ; 
my ©poses; 

foreach my $i (0 . .$NumCandi dates- 1) 

push ©poses, $pos[$i] if 
($ratios{$pos [$i] } >=$MinRatioHom) ; 

m1lt . a ^, „ e *Z int STDERR '"^target Examining , scalar ©poses, ■' homozygot. 
mutations at bases @poses\n" 

open(HOMO, ">$Pile .horn. $ target " ) ; 

a< ' Ti * Pri 2 t H ? M ? " pos char ratio refRegion bestRegion discrim 
discnmRatio refRatio\n M ; 

foreach $pos (©poses) 

my $min = $pos-9; 
$min = 0 if ($min<0) ; 
my $max = $pos+9; 

$max = length ($Seq) -10 if ($max> (length ($Seq) -10) ) ; 

my $votesRef=0; 

my $votesBest=0; 

foreach my $pos2 ($min..$max) 

my $weight = 1-abs ($pos2-$pos) * .1; 

$votesRef+=$ref Scores { $pos2 } *$weight ; 

my ©best = sort 
{ $bes tScores { $pos2 } { $b } <= >$bestScores { $pos2 } { $a } } keys 
%{$bestScores{$pos2}} ; " y 

my $best = $bestScores {$pos2} {$best [0] } ; 
^ $votesBest+=$best*$weight; 

my $ref Discrim = $votesRef /$votesBest ; 
my $theChar = substr ($Seq, $pos, 1) ; 
foreach my $Ch (A,C,T,G) 

my $mut=$Seq; 

subs t r ( $mu t , $pos , 1 ) » $Ch ; 

my $Pull2=getFulls ($mut , \@Pool, \%PoolInd) ; 
my $mutVotes=0; 
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my $mutBest=0; 

foreach my $pos2 ($min . . $max) 
my 

$baseVote=getVotes ($pos2 , $mut, $Full2 , \%Tens) ; 

my 

($ref Score, $bestSol , $arrRef ) =sort Votes (substr ($mut , $pos2 , 1) , $baseVote) ; 

my $weight = 1-abs ($pos2-$pos) * . 1 ; 
$mutVotes+=$ref Score*$weight ; 
$mutBest+=:$bestSol*$weight ; 

print HOMO "$pos $Ch »; 
printf HOMO "%5.3f 
" , $bestScores { $pos } { $Ch} / ( $ref Scores { $pos } + . 01 ) ; 

printf HOMO »%8.3f %8.3f » , $mutVotes, $mutBest ; 

my $discrim « $mutVotes/ ($mutBest+ .01) ; 

my $ratioToRef = $mutVotes/ ($votesRef + . 01) ; 

my $discrimRatio=$discrim/($refDiscrim+.01) ; 
printf HOMO "%5.3f %5.3f 
%5 . 3f \n» , $discrim l $discrimRatio, $ratioToRef ; 

} 

} 

close (HOMO) ; 

next if ( !$doHeterozygotes) ; 
# examine he tero zygotes 

my ©ref s = sort {$ref Scores {$a}<=>$ref Scores! $b} } keys 
%ref Scores; 1 n 7 

my ©poses; 
my ©pose; 

foreach my $i (int ($MinQuartile*$#ref s) . .$#refs) 

my $pos=$refs ($i] ; 
push @pose,$pos if 
(abs ($ratiosj$pos}-1.0) <=$MinRatioWindowHet) ; 

©poses = sort { abs ($ratios{$a} -1 . 0) <«> abs ($ratios {$b} -1 . 0) ) 
©pose; ' J 

my $min=$#poses; 

$min=$NumCandidates-l if ( $NumCandidates <= Smin) ; 
©poses = ©poses [0. .$min] ; 

print STDERR "T=$target Examining » ( scalar ©poses," 
heterozygote mutations at bases ®poses\n» ; 

open (H ETER , ">$File.het . $target") ; 

print HETER "pos char ratio ref Region bestRegion discrim 
discnmRatxo refRatio\n" ; 

foreach $pos (©poses) 

my $min e $pos-9; 
$min = 0 if ($min<0) ; 
my $max = $pos+9; 

$max = length ($Seq) -10 if <$max> (length ($Seq) -10) ) ; 

my $votesRef=0; 

my $votesBest=0; 

foreach my $pos2 ($min..$max) 

my $weight = l-abs <$pos2-$pos) * . 1 ; 

$votesRef +=$ref Scores { $pos2 } *$weight ; 

my ©best = sort 
( $bes tScores { $pos2 } { $b } <= >$bestScores { $pos2 }{$a) } keys 
%{$bestScores{$pos2}} ; 7 

my $best = $bestScores{ $pos2 } { $best f 0] } ; 
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^ $votesBest+=$best*$weight ; 

my $refDiscrim = $votesRef /$votesBest ; 
my $theChar = substr ($Seq, $pos , 1 ) ; 
foreach my $Ch (A,C,T,G) 

my $mut=$Seq; 

substr <$mut, $pos,l) =$Ch; 

my $Full2=getFulls ($mut, \@Pool, \%PoolInd) ; 
my $mutVotes=0; 
my $mutBest=0; 

foreach my $pos2 ($min..$max) 
my 

$baseVote=getVotes ($pos2, $mut, $Full2, \%Tens) ; 

my 

($ref Score, $bestSol , $arrRef ) =sort Votes (substr ($mut, $pos2, 1) , $baseVote) ; 

my $weight = 1-abs ($pos2-$pos) * . l ; 
$mutVotes+=$ref Score*$weight ; 
^ $mutBest+=$bestSol*$weight; 

print HETER "$pos $Ch " ; 
printf HETER "%5.3f 
» , $bestScores { $pos } { $Ch} / ( $ref Scores { $pos } + . 01 ) ; 

printf HETER »%8.3f %8.3f $mut Votes , $mutBest ; 

my $discrim = $mut Votes/ ($mutBest+ . 01) ; 

my $ratioToRef = $mut Votes/ ($votesRef+ . 01) ; 

my $discrimRatio=$discrim/($refDiscrim+.01) ; 
printf HETER "%5.3f %5.3f 
%5 . 3f \n" , $discrim, $diBcrimRatio, $ratioToRef ; 



close (HETER) ; 



sub read_sequence { 
my~($file) 
if (!-e $file) 
{ 

^ die "Unable to read wild-type sequence $file\n"; 

open(WT,$file) ; 
my $Seq; 
while <<WT>) 
{ 

next if (/#/); 
next if (/ \+/) ; 

Chop $_; 

$Seq.=uc($_) ; 
return rc ($Seq) ; 



sub rc { 

my $seq = shift; 

my %rc = ( "A" , "T" , B T M ,"A W , W C" ,"G n , w G n , n C n )• 
my ©chars = split (//, $seq) ; 
my $rc=" " ; 

for(my $it=$#chars; $i> = 0; $i--) 
$rc .= $rc{$chars [$i] } ; 
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} 

^ return $rc; 

sub read_pool { 

my ($poolfile) =©_; 

open (POOL, $poolf ile) || die "Could not read pool $poolf ile\n» 
my ($junk, $probe, ©pool); 
while (<PO0L>) 

{ 

next if (/>./) ; 
next if (/Group/) ; 
chop $_; 

($junk, $probe) =split (/\s+/ , $_) ; 
push ©pool, $probe; 

close (POOL) ; 
^ return \@pool; 

sub read_s cores { 

my $file=shift; 
open(SCORE, $f ile) ; 
my % Scores; 
while (<SCORE>) 
{ 

Chop 

my ($target , $probe, $pool , ©scores) =split (/\s+/ , $_) ; 
^ $Scores { $ target } { $probe } { $pool } =$scores [ $Collumn] 

^ return (\%Scores) ; 

sub new_score { 

my @sc = @_; 
return \@sc; 

sub sortVotes { 

my ($refChar, $voteRef ) =©_; 

my $refScore=$voteRef->{$refChar} ,- 

delete $voteRef ->{$ref Char} ; 

my @ch = sort { $voteRef ->{$b} <=> $voteRef - > { $a} } keys %{$voteRef); 
my $bestSol = $voteRef ->{$ch (0) } ; ' 
my ©subs; 

foreach my $ch (®ch) 
{ 

if ($ch eq uc($ch) ) 

push ©subs, $ch; 

} 1 

foreach my $ch (®ch) 

if ($ch eq lc($ch) ) 
{ 

push @subs,$ch; 

) } 

pop ©subs; 

^ return ($ref Score, $bestSol , \@subs) ; 
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sub getVotes { 

my ($pos,$seq,$Full,$Tens) = @_; 

my %baseVote; 

my $min = $pos-9; 

$min=0 if ($min<0) ; 

my $max m $pos,- 

$max = length ($seq) -10 if ($max > (length ($seq) -10) ) / 
my $theChar=substr ($seq, $pos, 1) ; 
foreach my $ch (A, C,G, T # a, c,g, t , d) 

my $mut=$seq; 

if ($ch eq uc($ch) ) { 

substr {$mut,$pos, l)=$ch ; 

elsif ($ch eq "d") { 

substr ( $mut , $pos , 1 ) = 11 " ; 

if (substr ($seq,$pos,l) eq substr ($seq, $pos+l, 1) ) 

$baseVote { $ch } = - 1 ; 
next ; 

. 1 

else { 

subs t r ( $mu t , $pos , 0 ) =uc ( $ ch ) ; 

if (substr($seq,$pos,l) eq uc($ch)) 

$baseVote{$ch}=-2; 
next ; 

} } 

$baseVote{$ch}=0 .0; 

foreach my $pos2 ($min. . $max) 

my $probe=substr ($mut, $pos2, 10) ; 

if (($ch eq $theChar) | | ( ! $Pull->{$probe} ) ) 

if ($OnlyPulls) 
{ 

^ $baseVote { $ch } +=$Tens - > { $probe } ; 

else 

{ 

$baseVote { $ch } +=$Tens - > { $probe } * $FullRat io ; 

} 

next if ($OnlyPulls) / 
foreach my $pos3 (0..9) 

next if ($pos3 == ($pos-$pos2) ) ; 
my $mutprobe=$probe; 
foreach my $ch2 (A, C, G, T) 

substr ($mutprobe,$pos3, 1) =$ch2; 

„ _ x $baseVote{$ch}+=$Tens->{$mutprobe} if 

{!$Pull->{$mutprobe}) ; 1 

, ■ ■ ' 

return \%baseVote; 
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sub getFulls { 
my %Full; 

my ($seq,$poolRef,$poolInd)=@_ ; 
foreach my $pos (0 .. length ($seq) -10) 

my $fp= substr($seq,$pos,5) ; 
my $lp= substr($seq, $pos+5,5) ; 
my $pool = $poolInd->{$lp} ; 
foreach $lp (@{ $poolRef - > [$pool] } ) 

^ $Full{$fp.$lp}=i ; 



} 



} 

return \%Pull; 



sub read_decision { 

my $file = shift; 

open <DEC,$file) || die "Unable to open decision rule f ile : $f ile\n» ; 
while (<DEC>) 

{ 

Chop 

my ©fields = split (/\s+/ , $_) ; 
if ($fields[0] =~ /log/) 

$LogScale = $f ields [1] ; 
f ($f ields [0] /candidates/) 

$NumCandidates=$f ields [1] ; 
f ($f ields [0] =- /quartile/) 
$MinQuartile=$f ields [1] ; 
if ($f ields [0] =- /window/) 

$MinRatioWindowHet=$f ields [1] ; 
f <$f ields [0] =~ /ref/) 

$MaxRatioRef =$f ields [1] ,- 
if ($fields[0] =~ /hom/) 

$MinRatioHom=$f ields [1] ; 
if <$f ields [0J =- /start/) 
$Start=$f ields [1] ; 
f ($f ields [0] /end/) 
$End=$f ields (13 ; 
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WHAT IS CLAIMED IS: 

1 . A method of sequencing a target nucleic acid, comprising: 

(a) contacting a target nucleic acid with a plurality of oligonucleotide 
probes of predetermined length and predetermined sequence, wherein each probe 
comprises an information region, under conditions which produce, on average, 
relatively more probe:target hybrids per probe for probes that are perfectly 
complementary in the information region of the probe than for probes that are 
substantially perfectly complementary in the information region of the probe, and 
relatively fewer probe:target hybrids that are significantly mismatched in the 
information region of the probe; 

(b) measuring the hybridization signal of said probes with the target 
nucleic acid; and 

(c) assigning a numerical voting score to each probe or pool of probes 
based on the relative strength of hybridization signal; and 

(d) determining the sequence of the target nucleic acid, comprising the step 
of summing the numerical voting scores of the probes in relation to their 
sequences. 



2. The method of claim 1 further comprising the step of modifying 
said numerical voting score assigned in step (c) by a voting factor selected based 
on the relationship of the probe to hypothetical target nucleic acid sequence. 

3. A method of sequencing a target nucleic acid, comprising: 
(a) contacting a target nucleic acid with two sets of a plurality of 

oligonucleotide probes of predetermined length and predetermined sequence, 
wherein each probe comprises an information region, and wherein one set of the 
probes are attached to a fixed support, under conditions which produce, on 
average, relatively more probe:target hybrids per probe for probes that are 
perfectly complementary in the information region of the probe than for probes 
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that are substantially perfectly complementary in the information region of the 
probe, and relatively fewer probe:target hybrids that are significantly mismatched 
in the information region of the probe; 

(b) covalently joining probes that form contiguous probe:target hybrids 
that are capable of ligation; 

(c) measuring the hybridization signal of said covalently joined probes 
with the target nucleic acid; and 

(d) assigning a numerical voting score to each covalently joined probe or 
pool of covalently joined probes based on the relative strength of hybridization 
signal; and 

(e) determining the sequence of the target nucleic acid, comprising the step 
of summing the numerical voting scores of the probes in relation to their 
sequences. 

4. The method of claim 3 further comprising the step of modifying 
said numerical voting score assigned in step (d) by a voting factor selected based 
on the relationship of the probe to hypothetical target nucleic acid sequence. 

5. The method of claim 3 wherein the probes not attached to a fixed 
support are labeled, and wherein the label is detected. 

6. The method of claim 1 or 3 wherein the probes are divided into 
pools or subpools and all probes within each pool or subpool are associated with 
an identification tag unique to the pool or subpool. 

7. The method of claim 6 wherein the identification tag is a 
fluorescent label. 
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8. An improved method of sequencing by hybridization using subsets 
of probes that have been labeled with different tags and pooled in one or more 
pools by mixing all or a number of probes from each subset. 
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